Rule-Based Chunker for Croatian

نویسندگان

Kristina Vuckovic

Marko Tadic

Zdravko Dovedan Han

چکیده

In this paper we discuss a rule-based approach to chunking sentences in Croatian, implemented using local regular grammars within the NooJ development environment. We describe the rules and their implementation by regular grammars and at the same time show that in NooJ environment it is extremely easy to fine tune their different sub-rules. Since Croatian has strong morphosyntactic features that are shared between most or all elements of a chunk, the rules are built by taking these features into account and strongly relying on them. For the evaluation of our chunker we used a extracted set of manually annotated sentences from 100 kw MSD/tagged and disambiguated Croatian corpus. Our chunker performed the best on VP-chunks (F: 97.01), while NP-chunks (F: 92.31) and PP-chunks (F: 83.08) were of lower quality. The results are comparable to chunker performance of CoNLL-2000 shared task of chunking.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Chunking Accuracy on Croatian Texts by Morphosyntactic Tagging

In this paper, we present the results of an experiment with utilizing a stochastic morphosyntactic tagger as a pre-processing module of a rule-based chunker and partial parser for Croatian in order to raise its overall chunking and partial parsing accuracy on Croatian texts. In order to conduct the experiment, we have manually chunked and partially parsed 459 sentences from the Croatia Weekly 1...

متن کامل

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation

phrases. The partial parser is motivated by an intuition (Abney, 1991): To acquire noun phrases from running texts is useful for many applications, such as word grouping, terminology indexing, etc. The reported literatures adopt pure probabilistic approach, or pure rule-based noun phrases grammar to tackle this problem. In this paper, we apply a probabilistic chunker to deciding the implicit bo...

متن کامل

Incorporating Head Recognition into a CRF Chunker

While rule-based shallow parsers usually recognise phrases’ syntactic heads, the same does not hold for statistical syntactic chunkers. The task of finding heads within already recognised chunks is not trivial for freer word order languages like German or Polish, while this information may be very useful. We propose a simple solution that allows to incorporate head recognition into existing chu...

متن کامل

Rule-Based Chunking and Reusability

In this paper we discuss a rule-based approach to chunking implemented using the LT-XML2 and LT-TTT2 tools. We describe the tools and the pipeline and grammars that have been developed for the task of chunking. We show that our rule-based approach is easy to adapt to different chunking styles and that the mark-up of further linguistic information such as nominal and verbal heads can be added to...

متن کامل

POS Tagger and Chunker for Tamil Language

This paper presents the Part Of Speech tagger and Chunker for Tamil using Machine learning techniques. Part Of Speech tagging and chunking are the fundamental processing steps for any language processing task. Part of speech (POS) tagging is the process of labeling automatic annotation of syntactic categories for each word in a corpus. Chunking is the task of identifying and segmenting the text...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Rule-Based Chunker for Croatian

نویسندگان

چکیده

منابع مشابه

Improving Chunking Accuracy on Croatian Texts by Morphosyntactic Tagging

Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation

Incorporating Head Recognition into a CRF Chunker

Rule-Based Chunking and Reusability

POS Tagger and Chunker for Tamil Language

عنوان ژورنال:

اشتراک گذاری